feat: add cloudflare-metrics worker for graphql analytics export#28
feat: add cloudflare-metrics worker for graphql analytics export#28zackpollard merged 109 commits intomainfrom
Conversation
Preview Deployments (01b5219)
|
Deployment statusCI has deployed the worker + dashboard to dev (PR-28 stage):
The worker is currently no-op on every cron tick because the analytics API token secret isn't wired up yet — Follow-up to start data collectionTo make the collector start emitting data to VictoriaMetrics/Grafana:
Why the token isn't generated via TerraformThe cleanest approach would be Integration test coverage
|
ef0a947 to
7c7d1d7
Compare
✅ Pipeline is now live end-to-endTerraform now owns the API tokenFollowing the devtools One wrinkle: Cloudflare provider v5 has a bug (#5045) where The real bugWhat took most of the debugging: my first version of the GraphQL client had Fix: look up Verified workingLatest cron tick at
That's the full 20-dataset collection making it end-to-end. The dev Grafana |
✅ Resource name enrichment (D1, queues, zones)
Both the If any of the three lookups fails it's reported via a Token permissionsThe Verified on
|
Zones: per-tag lookup for Pages projectsRoot cause: Fix
Verified on the 17:55 tick
Dashboard legends should now show real hostnames like |
✅ 1-minute granularity live, batched, and workingFinal results on the
New datasets shipped
GranularityDropped from Batching architectureThree design changes in
|
e0a6644 to
22b8202
Compare
…ng it via terraform
…ng bootstrap provider
…vive provider refresh
…e after v5 state wipe
29999 was the wrong workaround — the real issue was the bundled usage_model, not the value being out of range. 30000 works fine.
The curl was using -f which fails the whole terraform apply if the PATCH returns any HTTP error. The service-env settings are sticky once set, so we don't actually need to re-PATCH on every deploy — make it best-effort and just log the response. Also forces a fresh isolate via the new version, unblocking the worker which has been hitting CPU exceedances on a long-lived isolate.
Captures all console.log and runtime traces in Cloudflare's Observability dashboard so we can see what was happening during CPU-exceeded incidents (logs are otherwise lost when the invocation is killed). 100% sampling so we don't miss anything during debugging; can lower the head_sampling_rate later if log volume becomes a concern.
Observability data shows Cloudflare drops/delays about 25% of our cron triggers. When it recovers it fires the missed triggers as a burst (observed up to 10 at once) all at the same wall clock. With Date.now() as the query anchor, every catch-up invocation ran the exact same query and the originally-scheduled minutes were lost. Fix by anchoring the query window to controller.scheduledTime so each catch-up invocation queries the window it was originally scheduled for. Also bumps DEFAULT_WINDOW_MS from 3 → 5 min so each minute is covered by ~6 consecutive ticks instead of 4 — with a 25% miss rate the probability of all 6 ticks missing drops from 0.4% to 0.024%, which should essentially eliminate the gaps. VictoriaMetrics dedupes on (series, timestamp) so the extra overlap is free on the storage side.
Two bugs: (1) sum by (status) — the metric's label is http_response_status, not status. Everything was collapsing into a single empty-key series. (2) rate() on what's effectively a per-minute sum gauge, not a monotonic counter — the raw sample values are independent per minute, so rate's counter-reset handling isn't meaningful. Changed to a direct per-minute sum grouped by http_response_status. If still gappy after this, investigate whether the underlying samples are actually missing in VM.
Our collector emits per-minute gauge values (sum of requests/errors/ subrequests within each minute), not monotonic counters. rate() on these produces display artifacts — apparent gaps when the per-second derivative can't be computed meaningfully between non-monotonic samples. Switching to raw metric display shows the actual per-minute counts directly. Fixes reported gaps in subrequests-per-script for version-api-prod between 07:17-07:21 where Cloudflare source data confirmed all minutes had ~3000 subrequests and all cron ticks ran successfully.
All self-telemetry metrics are per-tick gauges emitted every minute. rate() is wrong on these (not counters), and 5m/10m increase() windows are unnecessarily wide for 1-minute data. Changed: - rate(metric[5m]) → raw metric (rows, points, lookup counts, HTTP) - increase(metric[5m]) → increase(metric[1m]) (errors, flush errors) - increase(metric[10m]) → increase(metric[1m]) (cron exceptions)
Per-minute gauge data displayed as disconnected points looks like random spikes. Set spanNulls=true and lineInterpolation=smooth on all 9 timeseries panels so data points connect into a continuous trend line.
Comprehensive fix across all 19 dashboards: - 84 rate(cf_...[Xm]) → raw metric: our collector emits per-minute gauge values, not monotonic counters. rate() on these produced erratic spikes whenever traffic dropped between minutes (interpreted as counter resets). Raw metric display shows actual per-minute counts. - 2 increase() windows narrowed from 5m/10m → 1m to match our 1-minute cron interval. - 132 timeseries panels styled: spanNulls=true, lineInterpolation=smooth, showPoints=never. Per-minute data points now connect into smooth trend lines instead of appearing as disconnected spikes.
…lo writes Cloudflare routes cron triggers to multiple colos simultaneously. Each colo queries the GraphQL analytics API and writes to VictoriaMetrics. Due to eventual consistency, different colos can return different aggregation counts for the same minute — one colo might see 3000 subrequests while another only sees 450 (partial data). VM's last-write-wins causes the stored value to oscillate between the competing writes, producing 6-7x swings on the dashboard. Fix by wrapping all 128 timeseries metric queries with max_over_time(metric[2m]). This takes the highest value for each sub-series over a 2-minute window, ensuring the most complete colo's data wins regardless of write order. For sum/count metrics, max picks the most complete data. For max/p99 metrics, max is also correct. Instant queries (stat panels using increase([24h])) are excluded since they aggregate over long windows where the oscillation averages out.
Our metrics are per-minute gauges (value = count within that minute), not monotonic counters. increase() computes (last - first) which is meaningless for gauges — it could return near-zero for "total requests in 24h" even when there were millions. Changed: - 65 stat panels: increase(metric[24h]) → sum_over_time(max_over_time(metric[1m])[24h:1m]) Correctly totals all per-minute values over the window. - 20 billing queries: increase(metric[1h]) → sum_over_time(metric[1h:1m]) Correctly computes hourly cost from per-minute data. - 6 self-telemetry queries: increase(metric[1m]) → raw metric or sum_over_time Zero rate() or increase() remaining across all 19 dashboards. Every cf_ metric query now uses the appropriate *_over_time wrapper: - max_over_time: timeseries (multi-colo dedup) - sum_over_time: totals (stat panels, billing, error tables) - last_over_time: storage snapshots (carry forward)
Panels were labeled with per-second units (reqps, ops, Bps) from when queries used rate(). Now that we display raw per-minute gauge values: - 42 panels: reqps → short (per-minute count, not per-second) - 9 panels: ops → short (same) - 4 panels: Bps → bytes (per-minute byte total, not bytes/sec) - 13 panel titles: removed "Rate" since we show counts not rates
- Guard division by zero in scheduled worker CPU avg (collector.ts) - Skip NaN/Infinity values in InfluxDB line protocol serialization - Remove dead applyResourceTags() function from emit.ts - Use max() instead of sum() in alert PromQL for multi-colo safety - Improve curl PATCH logging in worker.tf to surface failures - Add test suites: emit, metric-providers (escaping + NaN), flush-state, resource-cache, scheduled handler window logic
…-deploy PATCH the cloudflare_worker_version resource does support usage_model (deprecated but functional); setting it on the version itself is the durable fix. the prior post-deploy service-env PATCH was unreliable — after commit 3eda62b the worker ran standard for ~45 min then reverted to bundled on its own, causing 2+ hours of exceededCpu crons.
empirically verify what usage_model new cloudflare workers default to at the version level. scheduled handler burns >50ms of cpu so if the default is `bundled` (50ms cap) the cron will die with `exceededCpu`, and if `standard` it completes with `outcome=ok`. no usage_model field is set on cloudflare_worker_version in this commit — a follow-up will add `usage_model = "standard"` to confirm the fix.
commit a proved new workers default to bundled (50ms cap) — first cron on cpu-test-api-dev-pr-28 fired exceededCpu at cpu=50ms wall=51ms. add usage_model = "standard" to cloudflare_worker_version and re-verify via wrangler tail that cron outcome flips to ok with cpu > 50ms.
experiment confirmed empirically: - new cloudflare workers deployed via terraform default to usage_model=bundled (50ms cpu cap). cpu-test first cron: exceededCpu at cpu=50ms wall=51ms. - setting usage_model="standard" on cloudflare_worker_version lifts the cap. same handler, new version: cpu=2050ms wall=2105ms — 41x the bundled cap. - the worker has been deleted from cloudflare via the api; postgres tf state schema services_cf_workers_cpu-test_dev_pr-28 is orphaned but harmless (module dir removed so terragrunt run-all won't discover it).
re-create cpu-test worker with usage_model=standard + limits.cpu_ms=30000 to see if it also reverts to bundled after some time. bump cloudflare-metrics source to force a new version (the current one reverted to bundled after ~2 hours of stable operation). will poll the usage_model field to catch exactly when it flips.
previous deploy (ea725556) at 21:58 UTC reverted to bundled at 23:06 UTC (~1h8m post-deploy). bump FORCE_NEW_VERSION to trigger a new version and confirm both (a) redeploy fixes the 50ms cap immediately, and (b) the ~1 hour revert pattern repeats on the new version.
5a61fc0 to
4fd3d95
Compare
- prettier-format resource-cache.test.ts - pass explicit undefined to normalizeTagValue to satisfy TS2554 - add --passWithNoTests to cpu-test vitest (experimental worker has no tests)
4fd3d95 to
b53dfc5
Compare
- delete apps/cpu-test and its terraform module — experiment complete - drop FORCE_NEW_VERSION redeploy trigger from cloudflare-metrics
Adds a new Cloudflare Worker that runs on a 5-minute cron, queries the
Cloudflare GraphQL Analytics API for every resource type we currently
use (and several we don't yet), and pushes the data into VictoriaMetrics
via the existing InfluxDB line-protocol endpoint.
What's collected
20 datasets, each mapped to a
cf_*measurement with snake_case tags and fields:cf_workers_invocations,cf_workers_subrequests,cf_workers_overviewcf_d1_queries,cf_d1_storagecf_r2_operations,cf_r2_storagecf_kv_operations,cf_kv_storagecf_durable_objects_invocations,cf_durable_objects_periodic,cf_durable_objects_storage,cf_durable_objects_sql_storage,cf_durable_objects_subrequestscf_queue_operations,cf_queue_backlogcf_hyperdrive_queries,cf_hyperdrive_poolcf_http_requests_overviewcf_pages_functions_invocationsAll points are tagged with
account_idand written with the Cloudflarebucket timestamp so historical backfills land in the right place.
Structure
Mirrors the existing
versionworker:src/metrics.tsextends the shared pattern withfloatFieldand a customexport timestamp so analytics values don't get truncated to integers.
src/graphql-client.ts— typed wrapper over the Cloudflare GraphQL APIusing a single
JSONfilter variable (works around the per-datasetfilter input types).
src/datasets.ts— single registry describing every dataset's dimensions,aggregation blocks, and tag/field projection. Adding a new dataset is
one entry.
src/collector.ts— fetches each dataset, converts rows toMetricpoints, and records a self-observation per dataset (
cloudflare_metrics_collector_dataset).src/index.ts— fetch handler for/healthand/collect(manual trigger)plus the
scheduled()cron entry point.Testing
pnpm run test): line protocol formatting,query builder, variable builder, GraphQL client error paths, collector
dimension/field projection, dataset registry invariants, HTTP handler.
pnpm run test:integration, gated onCLOUDFLARE_API_TOKEN+CLOUDFLARE_ACCOUNT_ID): every dataset queryis executed against the real Cloudflare API and validated to match the
parser's expected shape, plus a full collector run. All 20 datasets
succeed; the first run against our production account emitted
11,330 metric points.
Deployment
New Terraform module at
deployment/modules/cloudflare/workers/cloudflare-metrics/:api-token.tf— provisions a scoped read-only Cloudflare API token with"Account Analytics Read" permission via
cloudflare_api_token.worker.tf— worker, version, deployment, and a*/5 * * * *cron trigger.No custom domain — the worker is only triggered by cron.